Goto

Collaborating Authors

 Agent Societies


MisoDICE: Multi-Agent Imitation from Unlabeled Mixed-Quality Demonstrations

Neural Information Processing Systems

We study offline imitation learning (IL) in cooperative multi-agent settings, where demonstrations have unlabeled mixed quality -- containing both expert and suboptimal trajectories. Our proposed solution is structured in two stages: trajectory labeling and multi-agent imitation learning, designed jointly to enable effective learning from heterogeneous, unlabeled data. In the first stage, we combine advances in large language models and preference-based reinforcement learning to construct a progressive labeling pipeline that distinguishes expert-quality trajectories. In the second stage, we introduce MisoDICE, a novel multi-agent IL algorithm that leverages these labels to learn robust policies while addressing the computational complexity of large joint state-action spaces. By extending the popular single-agent DICE framework to multi-agent settings with a new value decomposition and mixing architecture, our method yields a convex policy optimization objective and ensures consistency between global and local policies. We evaluate MisoDICE on multiple standard multi-agent RL benchmarks and demonstrate superior performance, especially when expert data is scarce.


APrinciple of Targeted Intervention for Multi-Agent Reinforcement Learning

Neural Information Processing Systems

Steering cooperative multi-agent reinforcement learning (MARL) towards desired outcomes is challenging, particularly when the global guidance from a human on the whole multi-agent system is impractical in a large-scale MARL. On the other hand, designing external mechanisms (e.g., intrinsic rewards and human feedback) to coordinate agents mostly relies on empirical studies, lacking a easy-to-use research tool. In this work, we employ multi-agent influence diagrams (MAIDs) as a graphical framework to address the above issues. First, we introduce the concept of MARL interaction paradigms (orthogonal to MARL learning paradigms), using MAIDs to analyze and visualize both unguided self-organization and global guidance mechanisms in MARL. Then, we design a new MARL interaction paradigm, referred to as the targeted intervention paradigm that is applied to only a single targeted agent, so the problem of global guidance can be mitigated. In implementation, we introduce a causal inference technique--referred to as Pre-Strategy Intervention (PSI)--to realize the targeted intervention paradigm. Since MAIDs can be regarded as a special class of causal diagrams, a composite desired outcome that integrates the primary task goal and an additional desired outcome can be achieved by maximizing the corresponding causal effect through the PSI. Moreover, the bundled relevance graph analysis of MAIDs provides a tool to identify whether an MARL learning paradigm is workable under the design of an MARL interaction paradigm. In experiments, we demonstrate the effectiveness of our proposed targeted intervention, and verify the result of relevance graph analysis.


Regret Lower Bounds for Decentralized Multi-Agent Stochastic Shortest Path Problems

Neural Information Processing Systems

Multi-agent systems (MAS) are central to applications such as swarm robotics and traffic routing, where agents must coordinate in a decentralized manner to achieve a common objective. Stochastic Shortest Path (SSP) problems provide a natural framework for modeling decentralized control in such settings. While the problem of learning in SSP has been extensively studied in single-agent settings, the decentralized multi-agent variant remains largely unexplored. In this work, we take a step towards addressing that gap. We study decentralized multi-agent SSPs (Dec-MASSPs) under linear function approximation, where the transition dynamics and costs are represented using linear models. Applying novel symmetry-based arguments, we identify the structure of optimal policies. Our main contribution is the first regret lower bound for this setting based on the construction of hard-tolearn instances for any number of agents, n. Our regret lower bound of โ„ฆ( K), over K episodes, highlights the inherent learning difficulty in Dec-MASSPs. These insights clarify the learning complexity of decentralized control and can further guide the design of efficient learning algorithms in multi-agent systems.


i.e., Policyi.e., Orchestratei.e., Agenti.e., Reinforcing Puppeteer Manupilate Puppet Environment Multi-Agent Collaboration via Evolving Orchestration

Neural Information Processing Systems

Large language models (LLMs) have achieved remarkable results across diverse downstream tasks, but their monolithic nature restricts scalability and efficiency in complex problem-solving. While recent research explores multi-agent collaboration among LLMs, most approaches rely on static organizational structures that struggle to adapt as task complexity and agent numbers grow, resulting in coordination overhead and inefficiencies. To this end, we propose a puppeteer-style paradigm for LLM-based multi-agent collaboration, where a centralized orchestrator ("puppeteer") dynamically directs agents ("puppets") in response to evolving task states. This orchestrator is trained via reinforcement learning to adaptively sequence and prioritize agents, enabling flexible and evolvable collective reasoning. Experiments on closed-and open-domain scenarios show that this method achieves superior performance with reduced computational costs. Analyses further reveal that the key improvements consistently stem from the emergence of more compact, cyclic reasoning structures under the orchestrator's evolution.


Oryx: a Scalable Sequence Model for Many-Agent Coordination in Offline MARL

Neural Information Processing Systems

A key challenge in offline multi-agent reinforcement learning (MARL) is achieving effective many-agent multi-step coordination in complex environments. In this work, we propose Oryx, a novel algorithm for offline cooperative MARL to directly address this challenge. Oryx adapts the recently proposed retention-based architecture Sable (Mahjoub et al., 2025) and combines it with a sequential form of implicit constraint Q-learning (ICQ) (Yang et al., 2021), to develop a novel offline autoregressive policy update scheme. This allows Oryx to solve complex coordination challenges while maintaining temporal coherence over long trajectories. We evaluate Oryx across a diverse set of benchmarks from prior works--SMAC, RWARE, and Multi-Agent MuJoCo--covering tasks of both discrete and continuous control, varying in scale and difficulty. Oryx achieves state-of-the-art performance on more than 80% of the 65 tested datasets, outperforming prior offline MARL methods and demonstrating robust generalisation across domains with many agents and long horizons. Finally, we introduce new datasets to push the limits of many-agent coordination in offline MARL, and demonstrate Oryx's superior ability to scale effectively in such settings.


Federated Multi-armed Bandits with Efficient Bit-Level Communications

Neural Information Processing Systems

In this work, we study the federated multi-armed bandit (FMAB) problem, where a set of agents collaboratively aim to minimize cumulative regret. Unlike traditional centralized bandit models, agents in FMAB settings are connected via a communication graph and cannot share data freely due to bandwidth limitations or privacy constraints. This raises a fundamental challenge: how to achieve optimal learning performance under stringent communication budgets. We propose a novel communication-efficient algorithm containing two points: one for eliminating suboptimal arms through early and frequent communication of key decisions, and the other for refining global estimates using incremental epoch, quantized, and differentially transmitted statistics. Incremental Epoch-based Successive Elimination Algorithm (EpoInc-SE) is presented by carefully balancing communication frequency and precision of global estimates. Theoretically, we derive tight upper bounds on both individual cumulative regret and group regret, and prove that our method asymptotically matches the lower bound of regret in federated settings.


Multi-Agent Reinforcement Learning with Communication-Constrained Priors

Neural Information Processing Systems

Communication is one of the effective means to improve the learning of cooperative policy in multi-agent systems. However, in most real-world scenarios, lossy communication is a prevalent issue. Existing multi-agent reinforcement learning with communication, due to their limited scalability and robustness, struggles to apply to complex and dynamic real-world environments. To address these challenges, we propose a generalized communication-constrained model to uniformly characterize communication conditions across different scenarios. Based on this, we utilize it as a learning prior to distinguish between lossy and lossless messages for specific scenarios. Additionally, we decouple the impact of lossy and lossless messages on distributed decision-making, drawing on a dual mutual information estimatior, and introduce a communication-constrained multi-agent reinforcement learning framework, quantifying the impact of communication messages into the global reward.


The Price of Opportunity Fairness in Matroid Allocation Problems

Neural Information Processing Systems

We consider matroid allocation problems under opportunity fairness constraints: resources need to be allocated to a set of agents under matroid constraints (which include classical problems such as bipartite matching). Agents are divided into C groups according to a sensitive attribute, and an allocation is opportunity-fair if each group receives the same share proportional to the maximum feasible allocation it could achieve in isolation. We study the Price of Fairness (PoF), i.e., the ratio between maximum size allocations and maximum size opportunity-fair allocations. We first provide a characterization of the PoF leveraging the underlying polymatroid structure of the allocation problem. Based on this characterization, we prove bounds on the PoF in various settings from fully adversarial (worst-case) to fully random. Notably, one of our main results considers an arbitrary matroid structure with agents randomly divided into groups. In this setting, we prove a PoF bound as a function of the (relative) size of the largest group. Our result implies that, as long as there is no dominant group (i.e., the largest group is not too large), opportunity fairness constraints do not induce any loss of social welfare (defined as the allocation size). Overall, our results give insights into which aspects of the problem's structure affect the trade-off between opportunity fairness and social welfare.


In-Context Fully Decentralized Cooperative Multi-Agent Reinforcement Learning

Neural Information Processing Systems

In this paper, we consider fully decentralized cooperative multi-agent reinforcement learning, where each agent has access only to the states, its local actions, and the shared rewards. The absence of information about other agents' actions typically leads to the non-stationarity problem during per-agent value function updates, and the relative overgeneralization issue during value function estimation. However, existing works fail to address both issues simultaneously, as they lack the capability to model the agents' joint policy in a fully decentralized setting. To overcome this limitation, we propose a simple yet effective method named Return-Aware Context (RAC). RAC formalizes the dynamically changing task, as locally perceived by each agent, as a contextual Markov Decision Process (MDP), and addresses both nonstationarity and relative overgeneralization through return-aware context modeling. Specifically, the contextual MDP attributes the non-stationary local dynamics of each agent to switches between contexts, each corresponding to a distinct joint policy.


Assignments for Congestion-Averse Agents: Seeking Competitive and Envy-Free Solutions

Neural Information Processing Systems

We investigate congested assignment problems where agents have preferences over both resources and their associated congestion levels. These agents are averse towards congestion, i.e., consistently preferring lower congestion for identical resources. Such scenarios are ubiquitous across domains including traffic management and school choice, where fair resource allocation is essential. We focus on the concept of competitiveness, recently introduced by Bogomolnaia and Moulin [6], and contribute a polynomial-time algorithm that determines competitiveness, resolving their open question. Additionally, we explore two optimization variants of congested assignments by examining the problem of finding envy-free or maximally competitive assignments that guarantee a certain amount of social welfare for every agent, termed top-guarantees [6]. While we prove that both problems are NP-hard, we develop parameterized algorithms with respect to the number of agents or resources.